Chapter 1: Welcome to jahapaula’s data science project

Here we are at 8am on the first lecture of the data science course. For me this is an attempt to dedicate two hours per week to actually work with R and git and try to get my skills on a more stable level. So far I’ve just been clicking around things and editing code made by other people without really understanding what I’m doing.

All the action will be taking place in my github repository for this course.


Chapter 2: Playing out with regression analysis

Starting off by loading all the packages needed for this analysis

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa

Data

This dataset is based on a survey collected from the students of Introduction to Social Statistics course in fall 2014, and it investigates the students’ pproaches to Learning (for a previuos study, see Vehkalahti 2015). The survey data includes responses from 166 participants (110 female, 56 male, mean age = 25.51), who participated the final exam and received at least one point.

students2014 <- read.csv("data/lrn14_final.csv", row.names = 1, header=TRUE)
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166   7

For the purposes of this study different variables measuring learning attitudes were computed as combined variables reflecting deep learning (deep), strategic learning (stra), and surface learning (surf, see Appendix A for the full survey questionnaire). All these variables are measured on a likert scale of 1 to 5 (see Table 1).

summary(students2014)
##  gender       Age           Attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           Points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00
# descriptive statistics would be given in a table here.

[Table 1 here]

Method

The cross-correlation table drawn with ggplot shows that biggest correlation coefficients exist between variables Points and Attitude (.437) as well as a negative correlation between variables of deep and surface learning (-0.324). Some interesting differences in the distributions can be observed by gender in the variables of Attitude and surface learning. These would require further statistical examination, which is not in the scope of this paper.

pic <- ggpairs(students2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
pic

Next, a multiple linear regression was used to investigate the relationship between learning styles and final exam points. Based on the cross-correlation table three variables were selected for the first model: attitude, strategic learning and surface learning (the last of which was expected to have a negative effect)

model1 <- lm(Points ~ Attitude + stra + surf, data = students2014)
summary(model1)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

[Table 2 here]

In Model 1 it is shown that attitude positive ifluences final exam points (beta 0.22952 p<.001), but the effect size is not very big. In this model, strategic learning and surface learning proved non-significant. Explanatory power (adjusted R-squared) for this model is 0.19, which mean that the model explains 21% of the variation of the dependent variabe Points. This R-squared is very low even for social sciences.

Therefore, two other models were tested by excluding the explanatory variable surf from the model. Surf had smaller effect and a larger p value in the model, in addition to which the correlation coeffient in the ggpairs was lower compared to stra.

model2 <- lm(Points ~ Attitude + stra, data = students2014)
summary(model2)
## 
## Call:
## lm(formula = Points ~ Attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## Attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

In Model 2 the postiive effect of attitude to final exam points remains (p<.001). In this model also strategic learning styles shows a positive effect over exam success (p<0.1). However, the explanatory power of the model is still very low, adjusted R-squared 0.20.

Also Model 3 with attitude and surface learning style as the independent variables was tested, but surface learning did not yield explanatory power and the R-squared was lower than for both Models 1 and 2. Hence, Model 2 was chosen as the final model, which shows that students performance in the final exam is explained by their attitude, measured as their global attitude toward statistics.

Finally, three diagnostic plots were produced to explore the validity of the model: Residuals vs Fitted values, Normal QQ-plot, and Residuals vs Leverage.

plot(model2, which = c(1, 2, 5))

First, regression analysis assumes that the size of errors in the model should not be dependent on the independent variables. The scatter plot of residuals vs. fitted values shows that the errors have a constant variance as no pattern is found in the plot apart from a few outliners in the middle.

Second, the QQ plot implies that the errors of the model follow a linear line reasonably well, showing that the errors are normally distributed.

Third, the residuals versus leverage show that there are no individual observations that would distort the model, as all observations are grouped close to the plot and no ouliers exist.

Hence, to conclude, all assumptions required for a regression analysis to be valid are met with this dataset and Model 2.

Conclusion

In the above conducted regression analysis the relationship between students’ exam performance and their learning strategies, age, and general attitude towards statistics was studied using multiple linear regression. The findings indicate that the most important variable eplaining students’ performance in the final exam is their attitude towards statistics. Also the adoption of startegic learning styles, such as planning one’s study week, listening to guidelines, organizing one’s time carefully, shows a minor effect to exam performance. However, the explanatory power of the model is not very high, so it is probable that there are other factors not measured in this study that would better explain high points in the final exam.

Nevertheless, based on this investigation, a practical conclusion would be: learn to love statistics to get better grades!


Chapter 3: Playing out with logistic regression analysis

Again starting off by loading all the packages needed for this analysis

library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)

Data

The dataset at hand is Student Alcohol Consumption Data Set, (see [1] and data information). The original survey data is collected on two different classes in two high schools. For this study the data from two different surveys were combined: from mathematics class and from portugese class. Only the students who participated the survey on both classes were included in the analysis, by joining the two datasetss using the background variables. The final dataset includes 382 respondents and 35 variables, listed below in Table 1.

# Table 1
alc <- read.csv("data/alc_final.csv", row.names = 1, header=TRUE)
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The purpose of this analysis is to study the predictors of high/low alcohol consumption. High alcohol comsumption is defined as taking more then 2 portions on average per week. We start the exploration with the following set of hypotheses.

First, students with more absenses or more past class failures are most likely to behave badly in other ways too (REF) and hence drink more alcohol.

H1. Number of school absenses predicts high alcohol consumption.

H2. Number of past class failures predicts high alcohol consumption

Second, alcohol consumption is often related to social activities (REF, REF)

H3. Higher reported value in going out with friends predicts high alcohol consumption.

Finally, based on previous national level studies (REF), male students are more likely to consume higher amounts of alcohol compared to women. Hence:

H4. Male gender predicts high alcohol consumption.

selected_columns <- c("sex","absences","failures","goout","high_use")
alc_sel <- select(alc, one_of(selected_columns))

Distributions

First lets investigate the distributions of the variables using gather and ggplot. To be honest I’m not sure why the gather-function is needed here.

gather(alc_sel) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables; they will
## be dropped

Next, the selected variables are compared to the variable that classifies the students who take over two doses of alcohol per week as high users of alcohol.

fig1 <- ggplot(alc_sel, aes(x = failures, fill = high_use))
fig1 + geom_bar() + ylab("failures") + ggtitle("Figure 1. Students' previous failures by alcohol consumption")

Figure 1 shows students’ previous course failures by alcohol consumption. The distribution is heavily biased towards zero. It is difficult to make any conclusion based on the figure only. Majority of the non-failed student are not high users of alcohol, however.

fig2 <- ggplot(alc, aes(x = high_use, y = absences))
fig2 + geom_boxplot() + ylab("absences") + ggtitle("Figure 2. Student absences by alcohol consumption")

Figure 2 shows student absences by alcohol consumption. We can observe that high users of alcohol have larger numbers of absences.

fig3 <- ggplot(alc_sel, aes(x = goout, fill = high_use))
fig3 + geom_bar() + ylab("going out") + ggtitle("Figure 3. Students reported going our with friend by alcohol consumption")

Figure 3 shows students’ reported going our with friend by alcohol consumption. It looks like students who are high-users of alcohol are also going out more with their friends. The majority of respondents report going moderately out.

fig4 <- ggplot(data = alc, aes(x = high_use))
fig4 + geom_bar() + facet_wrap("sex") + ggtitle("Figure 4. Students alcohol consumption by sex")

Figure 4 shows the distributions or high and low alcohol users by gender. We can observe that in males there are more high users in numbers and also proportionally.

Based on these graps it is reasonable to proceed with these variables. The only exception is previous failures, which is difficult to interpret from the plot.

Method

In order to test the hypotheses logistic regression was used.

# fit the model
m1 <- glm(high_use ~ goout + failures + absences + sex, data = alc, family = "binomial")
summary(m1)
## 
## Call:
## glm(formula = high_use ~ goout + failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0299  -0.7890  -0.5417   0.7857   2.3588  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.13390    0.47292  -8.741  < 2e-16 ***
## goout        0.70797    0.11998   5.901 3.62e-09 ***
## failures     0.34560    0.21093   1.638 0.101330    
## absences     0.08243    0.02235   3.688 0.000226 ***
## sexM         0.91999    0.25625   3.590 0.000330 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 385.06  on 377  degrees of freedom
## AIC: 395.06
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios confidence intervals
oddra <- coef(m1) %>% exp
coin <- confint(m1) %>% exp
## Waiting for profiling to be done...
cbind(oddra, coin)
##                  oddra       2.5 %     97.5 %
## (Intercept) 0.01602022 0.006086805 0.03902357
## goout       2.02986872 1.613471240 2.58531654
## failures    1.41283378 0.934759890 2.14797583
## absences    1.08592493 1.040580166 1.13727912
## sexM        2.50925591 1.526894419 4.17886012

The regression model confirms that three of the four variables are significant in predicting high alcohol use: absences, going out with friends, and male sex. Hence, hypotheses H1, H3 and H4 can be accepted. H2, however, suggested that previous course failures would predict high alcohol use. H2 must be rejected.

Odd rations for all explanatory variables are positive. Students who go out with friends are two times more likely to drink high dosages of alcohol, and male student 2,5 times more likely to drink high dosages. The effect of absences is only slightly above one at 1,09, but the confidence intervals nevertheless remain above one. Hence the positive effect is small but it exists.

For the final model the non-significant predictor of failures is removed.

# fit the new model
m2 <- glm(high_use ~ goout + absences + sex, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ goout + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7871  -0.8153  -0.5446   0.8357   2.4740  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.16317    0.47506  -8.764  < 2e-16 ***
## goout        0.72981    0.11970   6.097 1.08e-09 ***
## absences     0.08418    0.02237   3.764 0.000167 ***
## sexM         0.95872    0.25459   3.766 0.000166 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 387.75  on 378  degrees of freedom
## AIC: 395.75
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios confidence intervals
oddra <- coef(m2) %>% exp
coin <- confint(m2) %>% exp
## Waiting for profiling to be done...
cbind(oddra, coin)
##                  oddra       2.5 %     97.5 %
## (Intercept) 0.01555815 0.005885392 0.03804621
## goout       2.07468346 1.650182481 2.64111050
## absences    1.08782902 1.042458467 1.13933894
## sexM        2.60835292 1.593132148 4.33151387

Next let’s use the model to predict future values.

# predict() the probability of high_use
probabilities <- predict(m2, type = "response")
alc_sel <- mutate(alc_sel, probability = probabilities)
alc_sel <- mutate(alc_sel, prediction = probability > 0.5)
table(high_use = alc_sel$high_use, prediction = alc_sel$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   253   15
##    TRUE     65   49

Hence with this model in the current data set we result with 65 and 15 incorrectly classified students - stupid datapoints who are not following the model!

# a plot of high_use versus probability, prediction as colors
# g <- ggplot(alc_sel, aes(x = probability, y = high_use, col = prediction))
# g + geom_point()
# OMITTING THE PLOT BECAUSE I DON'T FIND IT VERY USEFUL OR ILLUMINATING

# tabulate the target variable versus the predictions
table(high_use = alc_sel$high_use, prediction = alc_sel$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.66230366 0.03926702 0.70157068
##    TRUE  0.17015707 0.12827225 0.29842932
##    Sum   0.83246073 0.16753927 1.00000000
# ripping the loss function from datacamp
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5 # sijoitta 1 n_wrongiin jos oik pätee eli kun arvaus meni pieleen
  mean(n_wrong)
}

# call loss_func to compute the average probablity proportion of wrong predictions in the (training) data
loss_func(class = alc_sel$high_use, prob = alc_sel$probability)
## [1] 0.2094241

This computed average proportion of incorrect predictions means that approximately 20 % of the predictions were incorrect following model m2. This training error is lower than the error achieved e.g. by guessing all students to be low-use alcohol consumers (29%). Hence the performance of the model is not perfect but better than a random guess. More variable testing would be needed to make the model more accurate.

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc_sel, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2251309

Performing a 10-fold cross validation for the model shows that the mean prediction error also for testing data is pretty close to the original training error.

Conclusion

In the above conducted logistic regression analysis the aim was to study the predictors of high alcohol consumption. Four hypotheses were formulated based on previous research (REF REF). The results show that high number of school absenses, higher reported value in going out with friends, and male sex predict high alcohol consumption.

References:

[1]. P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.


Chapter 4: Clustering and classification

Again starting off by loading all the packages needed for this analysis

# access the MASS package and load the data
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)
library(ggplot2)
library(GGally)
data("Boston")

Data

The dataset used in these analysis describes housing values in the suburbs of Boston. It is freely availble in the MASS package (more information here).

The dataset has 506 observations and 14 columns.

# explore data dimension and structures
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Next, let’s use pairs to get a graphical overview of the data. It seems that there are some variables seem to correlate positively or negatively, however not necessarily linear. In this analysis we are interested in the crime rates in particular. In the graph it seems that certain variables are heavile clustered based on low or high crime rates, such as chad (Charles River dummy variable), rad (index of accessibility to radial highways), tax rate and the proportion of blacks by town. However, some statistical analysis is required to show if this perception is actually true.

# graphical overview of the data
pairs(Boston)

Next, let’s standardize the dataset for the LDA analysis. Scaling subtracts the column means from the corresponding columns and divides the difference with standard deviation. We end up having variables that all have a mean of zero.

# center and standardize variables
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled) #otherwise its matrix

Linear Discriminant Analysis LDA

Next, we generate a new categorial variable of the crime variable (crim) by using the quantiles (quartiles) as break points. This new variable replaces the continuous crime variable in the dataset.

# save the scaled crim as scaled_crim
scaled_crim <- boston_scaled$crim
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Next, for the purposes of the following analysis, the dataset is divided to training and testing sets, so that 80% of the data belongs to the training set.

n <- nrow(boston_scaled) # gives number of rows
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create training and testing sets
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Next, we fit a linear discriminant analysis model on the training dataset, trying to predict the categorical crime variable using all other variables as predictors.

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2450495 0.2326733 0.2648515 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       1.0330429 -0.8837219 -0.12090214 -0.8763706  0.42404678
## med_low  -0.1089253 -0.3194922 -0.07348562 -0.5976183 -0.09328477
## med_high -0.3859772  0.1232999  0.31404756  0.3532397  0.24048943
## high     -0.4872402  1.0170108 -0.05155709  1.0904690 -0.40257869
##                 age        dis        rad        tax     ptratio
## low      -0.8603264  0.8977447 -0.6881287 -0.7257968 -0.43559230
## med_low  -0.4035598  0.4120579 -0.5422055 -0.5042487 -0.03404955
## med_high  0.4543320 -0.3605430 -0.4100812 -0.3294490 -0.29837200
## high      0.7790311 -0.8457203  1.6392096  1.5148289  0.78203563
##                black       lstat        medv
## low       0.38312572 -0.74957690  0.48268553
## med_low   0.31204593 -0.19715720  0.01027132
## med_high  0.06323572 -0.02043612  0.22268365
## high     -0.77936040  0.87439844 -0.68185855
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.048858189  0.83328339 -1.00682158
## indus    0.003990035 -0.21931535  0.38331944
## chas    -0.086306867 -0.12656293 -0.04743801
## nox      0.478610335 -0.57646152 -1.18754314
## rm      -0.104880542 -0.20537502 -0.16423037
## age      0.099636092 -0.39398842 -0.29155686
## dis     -0.071449010 -0.36719259  0.26316583
## rad      3.247540841  0.94122992  0.09360442
## tax      0.166515121 -0.08600031  0.37207968
## ptratio  0.126578325  0.07149853 -0.23515225
## black   -0.101558030  0.04222391  0.14005462
## lstat    0.323525351 -0.37031294  0.33322032
## medv     0.269793092 -0.44481921 -0.09086971
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9537 0.0348 0.0115
# the function for lda biplot arrows
# original? http://stackoverflow.com/questions/17232251/how-can-i-plot-a-biplot-for-lda-in-r
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
# The argument dimen to choose how many discriminants are used. 
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

In this model, LD1 explains 94,8 percent of the variance in crime categories, while LD2 and LD3 have small effects only. If I am interpreting the plot correctly, the LDA biplot shows that the category of high crime rate town areas are separated from other areas by their index of accessibility to radial highways (variable rad).

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11      10        2    0
##   med_low    6      15        6    0
##   med_high   0       9       21    2
##   high       0       0        0   20

When compared to the testing data set, in this LDA model all high crime rate areas are correctly classified. 11 medium high crime rate areas are falsely classified as medium low and 15 low crime areas to medium low. Hence, this model predicts well the high crime rate areas but some inaccuracies exist in other categories. Apart from low, majority of observations fall into correct categories.

K-means

Next let’s try another statistical method with the dataset, k-means: an unsupervised clustering method that assigns observations to groups or clusters based on similarity. In order to do this, we need to reload and standardize the Boston dataset to get comparable data. We calculate the distances using Eucleadian distance measure from standard R tools.

# reload Boston and scale it
data('Boston')
boston_scaledk <- scale(Boston)
boston_scaledk <- as.data.frame(boston_scaledk) #otherwise matrix
# calculate euclidean distance matrix
dist_eu <- dist(boston_scaledk) # obs dist takes time with large datasets
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000

Next, k-means function is used to run clustering.

# k-means clustering
km <-kmeans(dist_eu, centers = 5)

# plot the Boston dataset with clusters
pairs(boston_scaledk, col = km$cluster)

Total of within cluster sum of squares (WCSS) is used to determine the optimal number of clusters. We use set.seed to make sure same random numbers are generated in all laps and hence to ensure all results are reproducible.

set.seed(123) # to make sure different k's are comparable
k_max <- 10 # determine the max number of clusters

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})

# visualize the results
plot(1:k_max, twcss, type='b')

WCSS plot shows that the optimal number of clusters in this dataset is 2: that is where the total sum drops clearly. Therefore we run the final k-means clustering with 2 centers and visualize it.

# k-means clustering
km <-kmeans(dist_eu, centers = 2)

# plot the Boston dataset with clusters that didn't work out
# ggpairs(boston_scaledk, mapping = aes(col = cluster))

# plot the Boston dataset with clusters
pairs(boston_scaledk, col = km$cluster)

Conclusion

Both analysis point out to the fact that different suburbs of Boston are differentiated to two groups based on the crime rates: high crime rates separate from others. LDA implies this difference is most clearly visible with the variable that measures ditance from radius highways. Also variables such as full-value property-tax rate and proportion of residential land zoned for lots over 25,000 sq.ft seem to matter. It would be interesting to investigate the effect of highways more closely and to find out for instance if different traffic violations are included in the data and that distorts the observations.

References:

(none)


Chapter 5: Dimensionality reduction

Again starting off by loading all the packages needed for this analysis:

library(dplyr)
library(ggplot2)
library(GGally)
library(stringr)
library(tidyr)
human <- read.csv("data/human.csv", row.names = 1, header=TRUE)

Data

The dataset used in these analysis is a dataset that lists all countries of the world, their Human Development Index (HDI), as well as other related statistical indicators. As UNDP states, teh “The Human Development Index (HDI) is a summary measure of average achievement in key dimensions of human development: a long and healthy life, being knowledgeable and have a decent standard of living. The HDI is the geometric mean of normalized indices for each of the three dimensions.”

For the purposes of this analysis the data has been combined with information about Gender Inequality Index (GII). From the GII two ratios were calculated and stored as variables: the ratio of Female and Male populations with secondary education, and the ratio of labour force participation of females and males in each country. Finally, continent and world level summary datapoints were removed, as well as all observations with missing values (this is a prerequisite of the following analysis method).

All the original data is collected by the United Nations Development Programme (more information here).

The modified dataset has 155 observations and 8 variables. See descriptions of variables in [table1] below.

Variable Description
GNI Gross National Income per capita
Life_Exp Life expectancy at birth
Edu_Exp Expected years of education
MatMor Maternal mortality ratio
Ado.Birth Adolescent birth rate
Parlament_rep_F Percetange of female representatives in parliament
FM_Ratio_Edu Ratio of the proportion of females with at least secondary education
FM_Ratio_Lab Ratio of the proportion of females in the labour force

[Table 1. Variables of the data][table1]

# explore data dimension and structures
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ FM_Ratio_Edu   : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ FM_Ratio_Lab   : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life_Exp       : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu_exp        : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI_Cap        : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ MatMor         : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth      : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parlament_rep_F: num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

Next, let’s use pairs to get a graphical overview of the data. Life expectancy and expected years of education are positively correlated. Maternal mortality rates and adolescent birth rate are are heavily correlated, as well as maternal mortality and life expectancy, which together indicate that giving more labours is a risk to the morther’s health. Similarly both life expectancy and expected years of education are negatively correlated with adolescent mortality and maternal mortality rates. Fnally, GNI seems to correlate positively with larger life expectancy and with expected years of education.

ggpairs(human)

summary(human)
##   FM_Ratio_Edu     FM_Ratio_Lab       Life_Exp        Edu_exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##     GNI_Cap           MatMor         Ado.Birth      Parlament_rep_F
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
#library(corrplot)
#corrplot(human)

Next, let’s run Principal Component Analysis without scaling the data.

# perform principal component analysis (with the SVD method)
pca_human1 <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human1, choices = 1:2, cex = c(0.7, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Figure 1. A PCA biplot showing world countries on a two dimensional scale. Unscaled data, which makes the plot heavily distorted by the GNI variance.

Figure 1. A PCA biplot showing world countries on a two dimensional scale. Unscaled data, which makes the plot heavily distorted by the GNI variance.

As the plot implies, the PCA doesn’t exactly work with unscaled data as it is heavily distorted by the GNI variable. PCA assumes variables with larger variance are ore important than variables with saller variance (Nieminen & Kämäräinen, 2017). Hence, let’s make a better analysis by scaling all the variables.

# standardize the variables
human_std <- scale(human)

# perform principal component analysis (with the SVD method)
pca_human2 <- prcomp(human_std)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human2, choices = 1:2, cex = c(0.7, 1), col = c("grey40", "deeppink2"), main = " ", sub = "Figure 2.")
Figure 2. A PCA biplot showing world countries on a two dimensional scale. PC1 = Standard of living (negative), PC2 = Gender equality in politics and work.

Figure 2. A PCA biplot showing world countries on a two dimensional scale. PC1 = Standard of living (negative), PC2 = Gender equality in politics and work.

#fig.cap = "caption"

The plot indicates that the two main component dimensions are organized in relation to the original variables on two axis: vertical principal component dimension ranging from high-education high-life-expectancy and economic wellfare countries to countries characterized by high adolescent birth rate and hight maternal mortality rate; and the vertical dimension by the ratio of females in labour force and in the parliament - two variables that had a small correlation.

PC1 here could be indicated to negatively represent general standard of living measured by mortality/health rates and national income, randing from developed countries on the left side of the plot to developing countries on the ride side. PC2, which is situated orthogonally to PC1, indicates the status of women in the society measured by the ratio of women in the parliament and in the working life.

Let’s have some tea next!

Then to something completely different. We will next explore the tea-drinking habits of people using a tea datase from factominer. Let’s first explore the data a bit. It has 300 observations and 36 different variables.

library(FactoMineR)
data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
#summary(tea)

We notice that there is one integer variable in the data (age). Since we will soon use Correspondence Analysis, which accepts only category variables, we will transform the integer varibale age to a factor variable with 10-year levels.

tea$age2 <- cut(tea$age, breaks = c(15, 25, 35, 45, 55, 65, 75, 85))
tea_ <- dplyr::select(tea, -age)

After this the royal we here notices that there already exists a factorized variable age_Q in the data. Oh well.

tea_ <- dplyr::select(tea_, -age2) #overlyhonestmethods

In any case, the full data set is rather large for visualization, so let’s pick a few interesting varibales to investigate.

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "age_Q", "where", "friends", "sugar", "price", "tea.time", "sex")

# select the 'keep_columns' to create a new dataset tea_time
tea_time <- dplyr::select(tea_, one_of(keep_columns))

gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped

Next we will run a Multiple Correspondence Analysis on the tea data, first with the the full dataset and then with some selected columns.

# multiple correspondence analysis
mca <- MCA(tea_, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.090   0.082   0.070   0.063   0.056   0.053
## % of var.              5.838   5.292   4.551   4.057   3.616   3.465
## Cumulative % of var.   5.838  11.130  15.681  19.738  23.354  26.819
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.050   0.048   0.047   0.044   0.041   0.040
## % of var.              3.272   3.090   3.053   2.834   2.643   2.623
## Cumulative % of var.  30.091  33.181  36.234  39.068  41.711  44.334
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.039   0.037   0.036   0.035   0.034   0.032
## % of var.              2.531   2.388   2.302   2.275   2.172   2.085
## Cumulative % of var.  46.865  49.252  51.554  53.829  56.000  58.086
##                       Dim.19  Dim.20  Dim.21  Dim.22  Dim.23  Dim.24
## Variance               0.031   0.031   0.030   0.028   0.027   0.026
## % of var.              2.013   2.011   1.915   1.847   1.740   1.686
## Cumulative % of var.  60.099  62.110  64.025  65.872  67.611  69.297
##                       Dim.25  Dim.26  Dim.27  Dim.28  Dim.29  Dim.30
## Variance               0.025   0.025   0.024   0.024   0.023   0.022
## % of var.              1.638   1.609   1.571   1.524   1.459   1.425
## Cumulative % of var.  70.935  72.544  74.115  75.639  77.099  78.523
##                       Dim.31  Dim.32  Dim.33  Dim.34  Dim.35  Dim.36
## Variance               0.021   0.020   0.020   0.019   0.019   0.018
## % of var.              1.378   1.322   1.281   1.241   1.222   1.152
## Cumulative % of var.  79.901  81.223  82.504  83.745  84.967  86.119
##                       Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               0.017   0.017   0.016   0.015   0.015   0.014
## % of var.              1.092   1.072   1.019   0.993   0.950   0.924
## Cumulative % of var.  87.211  88.283  89.301  90.294  91.244  92.169
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48
## Variance               0.014   0.013   0.012   0.011   0.011   0.010
## % of var.              0.891   0.833   0.792   0.729   0.716   0.666
## Cumulative % of var.  93.060  93.893  94.684  95.414  96.130  96.796
##                       Dim.49  Dim.50  Dim.51  Dim.52  Dim.53  Dim.54
## Variance               0.010   0.009   0.009   0.008   0.007   0.006
## % of var.              0.660   0.605   0.584   0.519   0.447   0.390
## Cumulative % of var.  97.456  98.060  98.644  99.163  99.610 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1             | -0.580  1.246  0.174 |  0.155  0.098  0.012 |  0.052
## 2             | -0.376  0.522  0.108 |  0.293  0.350  0.066 | -0.164
## 3             |  0.083  0.026  0.004 | -0.155  0.099  0.015 |  0.122
## 4             | -0.569  1.196  0.236 | -0.273  0.304  0.054 | -0.019
## 5             | -0.145  0.078  0.020 | -0.142  0.083  0.019 |  0.002
## 6             | -0.676  1.693  0.272 | -0.284  0.330  0.048 | -0.021
## 7             | -0.191  0.135  0.027 |  0.020  0.002  0.000 |  0.141
## 8             | -0.043  0.007  0.001 |  0.108  0.047  0.009 | -0.089
## 9             | -0.027  0.003  0.000 |  0.267  0.291  0.049 |  0.341
## 10            |  0.205  0.155  0.028 |  0.366  0.546  0.089 |  0.281
##                  ctr   cos2  
## 1              0.013  0.001 |
## 2              0.127  0.021 |
## 3              0.071  0.009 |
## 4              0.002  0.000 |
## 5              0.000  0.000 |
## 6              0.002  0.000 |
## 7              0.095  0.015 |
## 8              0.038  0.006 |
## 9              0.553  0.080 |
## 10             0.374  0.052 |
## 
## Categories (the 10 first)
##                  Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test
## breakfast     |  0.182  0.504  0.031  3.022 |  0.020  0.007  0.000  0.330
## Not.breakfast | -0.168  0.465  0.031 -3.022 | -0.018  0.006  0.000 -0.330
## Not.tea time  | -0.556  4.286  0.240 -8.468 |  0.004  0.000  0.000  0.065
## tea time      |  0.431  3.322  0.240  8.468 | -0.003  0.000  0.000 -0.065
## evening       |  0.276  0.830  0.040  3.452 | -0.409  2.006  0.087 -5.109
## Not.evening   | -0.144  0.434  0.040 -3.452 |  0.214  1.049  0.087  5.109
## lunch         |  0.601  1.678  0.062  4.306 | -0.408  0.854  0.029 -2.924
## Not.lunch     | -0.103  0.288  0.062 -4.306 |  0.070  0.147  0.029  2.924
## dinner        | -1.105  2.709  0.092 -5.240 | -0.081  0.016  0.000 -0.386
## Not.dinner    |  0.083  0.204  0.092  5.240 |  0.006  0.001  0.000  0.386
##                  Dim.3    ctr   cos2 v.test  
## breakfast     | -0.107  0.225  0.011 -1.784 |
## Not.breakfast |  0.099  0.208  0.011  1.784 |
## Not.tea time  |  0.062  0.069  0.003  0.950 |
## tea time      | -0.048  0.054  0.003 -0.950 |
## evening       |  0.344  1.653  0.062  4.301 |
## Not.evening   | -0.180  0.864  0.062 -4.301 |
## lunch         |  0.240  0.343  0.010  1.719 |
## Not.lunch     | -0.041  0.059  0.010 -1.719 |
## dinner        |  0.796  1.805  0.048  3.777 |
## Not.dinner    | -0.060  0.136  0.048 -3.777 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.031 0.000 0.011 |
## tea.time      | 0.240 0.000 0.003 |
## evening       | 0.040 0.087 0.062 |
## lunch         | 0.062 0.029 0.010 |
## dinner        | 0.092 0.000 0.048 |
## always        | 0.056 0.035 0.007 |
## home          | 0.016 0.002 0.030 |
## work          | 0.075 0.020 0.022 |
## tearoom       | 0.321 0.019 0.031 |
## friends       | 0.186 0.061 0.030 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

# plot(mca, invisible=c("ind","var"))

The plot for the full dataset is rather difficult to interpret since all variables are gathered near the center of the plot. Let’s move on by using just the selected variables and try to investigate hos differently aged individuals use tea.

# removing some un-interesting variables
tea_time2 <- dplyr::select(tea_time, -price, -How)

# multiple correspondence analysis
mca_sub <- MCA(tea_time2, graph = FALSE)

# summary of the model
summary(mca_sub)
## 
## Call:
## MCA(X = tea_time2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.253   0.237   0.184   0.159   0.147   0.135
## % of var.             14.737  13.813  10.715   9.278   8.567   7.866
## Cumulative % of var.  14.737  28.550  39.265  48.543  57.111  64.977
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.123   0.111   0.110   0.101   0.079   0.077
## % of var.              7.173   6.468   6.405   5.897   4.610   4.471
## Cumulative % of var.  72.150  78.618  85.023  90.920  95.529 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                    |  0.500  0.329  0.110 |  0.384  0.208  0.065 |
## 2                    | -0.006  0.000  0.000 |  0.692  0.673  0.272 |
## 3                    | -0.514  0.349  0.232 |  0.049  0.003  0.002 |
## 4                    |  0.681  0.611  0.357 | -0.476  0.319  0.175 |
## 5                    |  0.441  0.257  0.128 |  0.337  0.160  0.075 |
## 6                    |  0.453  0.271  0.161 | -0.169  0.040  0.022 |
## 7                    | -0.236  0.074  0.034 |  0.049  0.003  0.001 |
## 8                    | -0.480  0.304  0.112 |  0.625  0.549  0.190 |
## 9                    | -0.293  0.113  0.040 |  0.308  0.134  0.044 |
## 10                   |  0.013  0.000  0.000 |  0.729  0.749  0.207 |
##                       Dim.3    ctr   cos2  
## 1                     0.142  0.036  0.009 |
## 2                    -0.848  1.306  0.409 |
## 3                    -0.573  0.596  0.289 |
## 4                    -0.274  0.136  0.058 |
## 5                    -0.691  0.866  0.315 |
## 6                    -0.469  0.400  0.172 |
## 7                     0.222  0.089  0.030 |
## 8                    -0.202  0.074  0.020 |
## 9                     0.409  0.304  0.078 |
## 10                    0.399  0.289  0.062 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black                |  -0.314   1.373   0.032  -3.104 |   0.958  13.661
## Earl Grey            |  -0.035   0.045   0.002  -0.817 |  -0.540  11.331
## green                |   0.909   5.142   0.102   5.527 |   1.012   6.791
## 15-24                |  -0.161   0.448   0.011  -1.848 |  -1.008  18.812
## 25-34                |   0.980  12.503   0.287   9.266 |  -0.080   0.090
## 35-44                |  -0.519   2.031   0.041  -3.520 |   0.423   1.437
## 45-59                |  -0.204   0.480   0.011  -1.784 |   0.714   6.260
## +60                  |  -0.517   1.916   0.039  -3.406 |   0.996   7.575
## chain store          |   0.181   1.185   0.058   4.171 |  -0.161   1.003
## chain store+tea shop |  -0.733   7.890   0.189  -7.508 |  -0.031   0.016
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                  0.301   9.480 |   0.176   0.591   0.010   1.737 |
## Earl Grey              0.527 -12.548 |  -0.003   0.000   0.000  -0.064 |
## green                  0.126   6.149 |  -0.378   1.219   0.018  -2.295 |
## 15-24                  0.450 -11.596 |  -0.382   3.478   0.064  -4.391 |
## 25-34                  0.002  -0.760 |   0.569   5.796   0.097   5.380 |
## 35-44                  0.027   2.867 |   0.686   4.882   0.072   4.654 |
## 45-59                  0.130   6.241 |  -1.046  17.315   0.279  -9.141 |
## +60                    0.144   6.557 |   0.848   7.090   0.104   5.587 |
## chain store            0.046  -3.716 |  -0.488  11.871   0.424 -11.260 |
## chain store+tea shop   0.000  -0.323 |   0.871  15.357   0.267   8.932 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## Tea                  | 0.116 0.527 0.023 |
## age_Q                | 0.307 0.566 0.496 |
## where                | 0.216 0.141 0.424 |
## friends              | 0.116 0.129 0.144 |
## sugar                | 0.160 0.273 0.086 |
## tea.time             | 0.450 0.001 0.011 |
## sex                  | 0.404 0.020 0.102 |
# visualize MCA
plot(mca_sub, invisible=c("ind"), habillage = "quali")

# plot(mca, invisible=c("ind","var"))

After several rounds of testing different variables no heureka findings emerged, but the final pot above indicates some patterns: respondents drink green tea alone and Earl Grey ith friends. Sugar and Earl Grey tend to go together. Green tea is bought from a tea shop.

Regarding the demographic variables it seems that yonger respondents don’t care about any particular tea time, whereas older generations do. Teenagers like Earl Grey with sugar; middle aged and older drink black tea without sugar. Males tend to drink tea alone, women with friends.

Over and out!